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The Indonesian tafseer and translation of Holy Quran is an important source 
of information and knowledge for Indonesian muslims, since not many 
Indonesian muslims understand Arabic language in the Quran. However, the 
tafseer is full of the commentaries and explanation of each surah (chapter) 
and/or ayah (verse), which form a large document and not so easy to be 
accessed. Thus, the challenge is how to refer to both tafseer and translation 
in faster and accurate ways as one needs to always refer to them back and 
forth. Hence, this study proposes several text mining approaches, i.e. most 
frequent words, K-means clustering, and association rules, to analyze an 
Indonesian tafseer and translation of Quran and provide insights of hidden 
knowledge and relationships based on statistical information derived from it. 
These insights could be useful for muslims in general and for people that 
doing research in related areas. This study shows interesting results from 
combined analysis of the approaches used which can help people accessing 


information in tafseer more efficiently. As well, interesting relationships 
have been drawn from terms in the tafseer which could provide further and 
deeper knowledge on messages in the Quran. 
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1. INTRODUCTION 

In the recent years, natural language processing (NLP) has been widely used for the automation 
related with translation or interpretation. Within NLP there is text mining which considered as one of its 
branches as it is using some fundamental methods in NLP but with different goals. Unlike NLP which cares 
about semantics information in the text, in the text mining there is also a method which treats the text as the 
‘bag of word’, meaning the semantics information is not explored. The main goal in text mining is to analyze 
both unstructured and structured large text dataset so that one does not have to read the whole text [1]. This 
has lead Text Mining in becoming a valuable research area as the existing improvement of artificial 
intelligence (AI) has been on the level where the extraction of information in a textual data has to be 
automated. The result from text mining is the information of the terms and words analysis. Many large text 
data artifacts have become the data source for research in the text mining area. One of those large text data 
sources is the Holy Quran. 

The Holy Quran is the most valuable book for muslims, i.e. people with Islamic religion, as they 
believe it is containing the words of God. Inside the Quran, there are fundamental categories of knowledge 
which have to be understood and recited by all muslims [2]. The original language of Quran is Arabic. Since 
many muslims do not understand Arabic properly, Quran has been translated into many languages, including 
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Indonesian language, to make its contents easier to be understood. However, to some extent, translation of 
Arabic Quran only is not enough for general people to really understand the exact meaning of messages in 
the Quran. That is why there are some honorable and knowledgeable people who write and create some 
commentary books of the Quran called mufassir and the books called as tafseer. The explanation and 
commentary inside the tafseer must not base on the individual opinion. The contents of the Quran must stay 
the same so all the commentary must refer to the explanation from the Prophet Muhammad [3]. 

As tafseer and translation of the Quran dealing with long sentences and words, it becomes a 
challenge to extract the valuable information from both of them. In this technological era, people tend to 
leave conventional thing such as refer to a thick book by opening one page to another page. The invention of 
information retrieval algorithm and text mining in natural language processing (NLP) has enabled people to 
mine valuable information inside large text documents faster automatically and might be a possible answer to 
the referencing tafseer and translation challenge. There are several Quran-related NLP studies, for examples 
the ones by [4]-[6]. However, there are rarely found NLP studies on the Indonesian tafseer of the Quran, 
whereas this tafseer has great importance for muslims in understanding the contents of the Quran, especially 
the ones with little arabic language knowledge. Thus, this research study aims to utilize text mining 
techniques to retrieve the insights of the Indonesian tafseer, to uncover hidden knowledge and relationships 
of materials discussed in the Quran. 

The organization of this paper starts with an introduction in section 1, which then followed by 
section 2 which presents reviews on some related works from previous research in the literature. After that, 
methodology applied in the study is discussed in section 3, and then section 4 presents the results of the 
conducted research complete with its discussions. Last section is the conclusion which presents the summary 
of the key findings and its takeaways. 


2. LITERATURE REVIEW 

Generally, there are three types of research areas in the text mining, ie. techniques for 
preprocessing, comparative studies about machine learning for both classification and clustering as well as 
the feature extraction algorithm comparison, and the study about the text dataset exploration result for the 
mining. Many of the studies on text mining in general are concentrated on the preprocessing stage of the text 
mining. This is due to the needs for further improvement in preprocessing since it is a crucial stage which can 
affect to the result significantly. 

The preprocessing includes tokenization, normalization and substitution. Besides the preprocessing, 
the selection of the methods also is one of the trends in the research area. The researchers usually compare 
two or more common method in text mining, whether it is about clustering or classification [7], [8]. Another 
research area for text mining is to implement the text mining to a specific dataset with the focus on that 
dataset like a research work done by Alhawarat et al. [9] on Arabic language dataset and conducted research 
by Matsumoto ef al. [10] on combining numerical and text dataset. Moreover, there are also studies in 
comparing two or more distance calculation techniques in determining the similarities when doing clustering 
or classification [11]. 

Quran has also been a subject of text mining as one of dataset sources. However, text mining 
research in Quran are not only focus on the dataset, it can also accommodate all of those general three types 
mentioned earlier and a combination between them. Researchers can study the algorithm used for the Quran 
text mining. Within this type of research, the researchers can compare two or more algorithm to extract the 
most valuable information inside the Quran. Several text mining studies on Quran explored and analyzed the 
classification of its content as reported by [12]-[16]. Another type of research in Quran text mining is focus 
in the specific dataset and analyzing the text mining result acted to the datasets, which are Indonesian Tafseer 
and Translation. There are also previous related works about text mining for Quran and Tafseer related with 
different goals among them as the works by [2], [5], [17], [18]. 

As the Quran contains chapters and already decided in the past, researchers want to explore the rule 
that made the division of the Quran. A good example is the work done in [5] with the goal to do the analysis 
on the frequent patterns that can be found in the chapters of a Malay translated tafseer of Quran; the 
techniques are frequent pattern mining, non-trivial patterns and interesting relations. The findings of the 
study were the processed dataset: 6 documents and 17 terms. The term weighting is term frequency—inverse 
document frequency (TF-IDF). Three most frequent terms are “Allah”, “Muhammad”, and “wahai”. The 
different type of research is presented by Khadangi et al, [4] which intended to study the similarity of topics 
in Quranic surahs; the methodology was natural language processing methods which are word2vec and roots’ 
accompaniment in Verses. The finding was the knowledge that the choice of the surah's title is based on 
rational logic, the surahs hold the inner coherence between the concepts so that they have formed on a single 
topic or a few topics tightly related to each other [4]. 


Indonesian J Elec Eng & Comp Sci, Vol. 25, No. 3, March 2022: 1469-1480 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 O 1471 


An analysis of a text mining algorithm on Quran is presented by Qi et al. [19] looked through the 
semantic information inside the Quran. The objective was to contribute in building an algorithm with 
semantic analysis and automatic identification areas. The research compared and analyzed semantically 
between Chinese and Arabic written language of Quran. The algorithm used in the research study was 
Semantic annotated corpus and semantic knowledge base. 

There was also a study which explored the Quran Tafseer in Malay Language. The aim was to 
provide classification algorithm for Quran Tafseer verses automatically. This research study by Hamoud and 
Atwell [18] used K-nearest neighbor (KNN) or classifier and cosine similarity as the distance. The result of 
the study was a contribution to Malay Quran tafseer category classification. From this study we can learn that 
one way to contribute in NLP study of Quran, is to strengthen the algorithm in building a good tidy corpus. 
Another study went to that direction and did research in the exploration of making the corpus to build the 
tagging algorithm for creating a prototype which is able to extract collocation of N-gram words [17]. This 
N-gram words consist of 2 until 6 words from Arabic Quran corpus ordered by part of speech tagging. The 
result showed that the proposed system succeeded to make the users select a sequence of tags (2-6 gram) and 
scope of the corpus source. In addition, a study to reveal frequent patterns in Holy Quran (Arabic) using text 
mining has been reported in [20] that can be used to analyze further the Quran and bring more comprehensive 
understanding. Among those explored research studies within NLP-text mining related to Quran, we have not 
found the one which focuses in Indonesian tafseer of Quran. Since Indonesia is a country with the biggest 
number of muslims in the world, and not many Indonesia can understand Arabic well, then a technology- 
based approach like text mining that can help in extracting hidden knowledge from Quran through its tafseer 
will be beneficial. 


3. METHODOLOGY 

There are several steps conducted for the text mining process applied in this study, as presented in 
Figure 1. This whole process was conducted for tafseer and translation with the same steps. The dataset used 
was from: KEMENAG Indonesian tafseer and translation, “tahlili? 2011 version all Juz. The tool used for 
feature selection until frequent term mining is R and RStudio 3.6.3 as the IDE. 
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Pre processing or feature selection 


Figure 1. The text mining process for the Indonesian tafseer and translation of the Quran 


3.1. Preprocessing or feature selection 

The preprocessing or feature selection stage includes case folding, tokenization, stemming words 
and stop words elimination. Preprocessing is needed to reduce the unwanted words which have no significant 
meaning, noise, into text mining. This step also done to reduce the redundancy and repetition. Those steps are 
reversible and can go back to any step if it is required. 


3.2. Feature extraction using TF-IDF 
The TF-IDF is considered as one of the most powerful feature extractions [21]; it is because unlike 
the bag of word method, this method is not only seeing the most frequent terms so that the undominant word 
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is eliminated; the TD-IDF is also weighting the terms based on how frequent the term in a document 
compared to how frequent the term in the whole documents. By doing TF-IDF, the most frequent word is 
rescaled. The mathematical model for the TF—IDF is shown in (1) [21]. 

For a term i in the document j: 


Wi = thy X log (=) (1) 


where: 

tfi; = Number of occurance ini and j 
df; = Number of documents containing i 
N =Total of documents 


3.3. Most frequent words mining 

In this stage, the most frequent words are extracted from both tafseer and translation. The result of 
the most frequent words measured by TF is represented and visualized in the form of word clouds. The other 
presentation of the result, which is the frequency measured by TF—IDF is in the form of the bar plot of each 
tafseer and translation result. Not only seeing the most frequent words, the result is also evaluated in terms of 
its correlation using pearson correlation coefficient. 


3.4. K-means clustering 
The clustering in this study was performed based on Euclidian distance between terms or words. The 
Euclidian distance in (2). 


IIA — Bl| = 1 (a — b;)? (2) 


where A and B are points in d dimensional space such that: A = [a1, a2, ..., aa] and B = [b1, by, ..., bal. 

After getting each distance, then the clustering methods are applied. The K-means algorithm is one 
of the partitional clustering, meaning the clusters dataset are fully divided from the others and treated as 
different cluster. The first thing to do in K-means clustering is assigning the number of clustering, k. After 
that, initially, the random centroid for k cluster is chosen. The iteration of K-Means is done until the mean of 
each training data to the centroid met the stopping criterion, whereas the smallest Euclidean distance from a 
sample is the nearest centroid for the sample to be the one with [22], [23]. 

In order to present the best clustering results, preliminary experiments were done. One of the 
approaches to know the optimal number of k is by seeing the elbow of sum square of error (SSE) of cluster 
center plot. Thus, in the k-means clustering stage, preliminary experiments were conducted to get the best 
valuer of k, before the main clustering process was done. 


3.5. Association rules mining 

Originally, frequent pattern (FP) growth algorithm is used for knowing the association rules in the 
relational database of transaction. The formal definition of association rule was presented by Agrawal et al. 
[24] as the following description. Let J = J; + I2 + ... + Im be a set of items or binary attributes. Let D be a set 
of all transactions where each transaction T is a set of items such that T & I. Let X, Y be a set of items such 
that X, Y © I. From those definitions, there is the association rule implication which presented in the form 
X > Y, where X c I, Y c I, X N Y = ø [24]. 

When dealing with association rules, there are two values which need to be analyzed, which are 
support and confidence values. In the case of Support, if s% of transactions in D contain X U Y then 
association rule for X > Y be having s as the support value; whereas for the case of Confidence, if c% of the 
transactions in D that contain X also contain Y then the association rule for X => Y be having c as the 
confidence value. Association rules mining can also be used to capture positive and negative association 
among the items based on their frequency of appearance, eventhough major association rules tend to go for 
the positive association [25]. 


4. RESULTS AND DISCUSSION 

This section presents the results of the conducted research and discussion related to it. There are five 
sub-sections here, where each sub-section discusses results from each step performed in the text mining of 
the tafseer and translation of Quran. The five sub-sections are preprocessing results, feature extraction 
results, most frequent words mining results, K-means clustering results, and association rules results. 
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4.1. Preprocessing or feature selection results 

When it comes to feature selection result, the datasets would not form any meaningful sentences 
anymore as some words taken away from the datasets. Figure 2 shows samples of the results from feature 
selection stage on data from tafseer in Figure 2(a) and translation in Figure 2(b). It can be seen from the 
presented samples that each word has been tokenized, cased folded into uppercase, and stemmed. The data in 
Figure 2 also shows that there are some different words and some similar words as the results of the 
pre-processing onto the Tafseer and the Translation. Further processes, like clustering (in Section 4.4) and 
association rules (in Section 4.5) will be able to show what can be revealed from those differences and 
similarities. 


SURAH PENDAPAT ULAMA ABU AHLI ALLAH AYAT 
134.71901 134.11199 66. 08931 158.08545 98. 83268 808. 36051 553.46830 
MEDINAH SALAT HADIS NABI ALBUKHARI MUSLIM RIWAYAT 
50.42218 196. 86536 362. 28900 527. 68617 115. 59964 154.48807 218.49652 
UMAR MEKAH ABBAS RASULULLAH AHMAD HURAIRAH SAHABAT 
75.60364 125.21297 75.93314 393.16143 76. 56092 82.72732 63. 75376 
ALQURAN MELARANG MANUSIA MUHAMMAD PERINTAH TUHAN WAHYU 
322. 96000 58.01810 437.15789 342.15315 236. 57726 332.61349 87. 30771 
(a) 
DOSA NYATA BESAR KIAMAT KITAB MENGIKUTI TAKUT MENDUSTAKAN DUNIA 
107 139 175 174 249 110 146 138 181 
KAFIR LAKILAKI RAHMAT KEHIDUPAN AIR MALAM SURGA KEKAL MELIHAT 
414 136 118 177 143 116 164 102 232 
PAHALA NERAKA CIPTA MATI REZEKI TAQWA BERPALING MENGATAKAN KEBAJIKAN 
101 274 162 119 109 221 108 115 129 
MENDENGAR HATI AKHIRAT ANAK GOLONGAN KAUM FIRAUN MUSA NEGERI 
119 194 139 150 120 207 111 235 128 
SALAT HARTA FIRMAN PERBUATAN MALAIKAT IBRAHIM SETAN PEREMPUAN 
100 133 154 107 133 126 110 176 
4 
(b) 


Figure 2. Samples of preprocessing or feature selection results of tafseer and translation, (a) feature selected 
samples of tafseer and (b) feature selected sample translation 


4.2. Feature extraction results 

TF-IDF algorithm was used in this feature extraction process. Table 1 shows the matrix property of 
the term document matrices (TDM) of the tafseer and translation dataset. The tafseer contains 488 significant 
terms for the TF—IDF calculation while translation have 116 terms. These terms are presentend as the 
columns of the term document matrix and the occurrence of each term is weighted from each document. The 
total documents, or in this case sentences, of the tafseer was 18450 and the translation was 6234. The 
non-sparse entries of each matrix show as the nonzero entries and the sparse entries are as the zeros entries. 
The maximal length in tafseer wasl4 words of each document and 13 words of each document for 
translation. The visualizations of word TF—IDF are presented in Figure 3 for both tafseer in Figure 3(a) and 
translation in Figure 3(b). The two figures show similar curve for the TF—-IDF values. There are around 4 
words or terms which have significant difference values compared to the others. Further discussion about 
those numbers is presented on the next section, i.e. most frequent word mining results. 


4.3. Most frequent words mining results 

Since in the feature extraction stage TF-IDF was used for weighting the term frequency, then this 
most frequent words mining is another automatic result from the TF—IDF algorithm. Figure 4 shows the bar 
plot of the 30 most frequent words in the tafseer Figure 4(a) and translation Figure 4(b) measured by 
TF-IDF, respectively. Based on the TF—IDF definition, those words are the most likely to appear in each 
sentence of the tafseer and translation. Previous work studying frequent items in tafseer of the Quran in 
Malay [6] has reported 17 words that frequenly appeared in the tafseer, which were: aku, Allah, apabila, 
berlindung, katakanlah, kejahatan, makhluk, manusia, masuk, menguasai, Muhammad, orang, pula, sekalian, 
tuhan, ugama, dan wahai. The study also reported that “Allah”, “muhammad”, and “wahai” are the most 
frequent ones among those 17 items. Comparing to our results as presented in Figure 4, there are some words 
which are intersection between them: Allah, Muhammad, tuhan, manusia, agama (note: ugama in Malay). 
Only those five words are found in both works. This indicates the importance of those five words in the 
Quran and its tafseer in different languages. Whereas for other words which are not in the intersection, it 
could be due to the difference in the way of explaining the meaning of the ayah, which made the words usage 
was not the same as well. 
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Results in Figure 4 present frequent words in tafseer Figure 4(a) and translation Figure 4(b) of 
In order to know the level of correlation between tafseer and translation, the calculation of pearson 


correlation coefficient needs to be done. The correlation observation is performed on the mutual words 
between the tafseer and translation, to see whether the pattern is the same or not. The pattern observation is 
on how much the tendency of the frequency of a particular word in tafseer and translation being affected by 
each other. 


Terms 


Table 1. The matrix property of TDM of the tafseer and translation 


Data source Terms Documents Non-sparse entries Sparse entries Maximal length 


Tafseer 488 18450 234693 8768907 14 
Translation 116 6234 18815 704329 13 
Word TF-IDF frequencies Word TF-IDF frequencies 
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Figure 3. TF-IDF term frequency plots of the tafseer and translation, (a) tafseer TF-IDF plot and 
(b) translation TF-IDF plot 


Term frequencies Term frequencies 
FIRMAN - A LAH- iis 
ALLAH- M TUHAN- 
Y- IMAN- 
NABI- M MUHAMMAD- M 
MANUSIA- M KEBENARAN - MAM 
RASULULLAH- ME AZAB- M 
SABDA- EE MANUSIA- M 
HADIS- ME BUMI- ME 
MUHAMMAD- ME ALQURAN- M 
KAUM- B KAFIR- 
TUHAN- E NERAKA- | 
PERBUATAN- MEE KEBAIKAN - MEM 
KEBAIKAN- ME LANGIT- | 
ALQURAN- E o RASUL- ME 
IMAN- E Lia | 
BUMI- E D MEMBERI- MAM 
MEMBERI- EEE | MENDUSTAKAN - MEME 
AGAMA- E KIAMAT- M 
KAFIR- B MELIHAT- M 
RASUL- EE PERINGATAN - MEM 
KEBENARAN- ME MUSA- E 
MENGETAHUI- ME PETUNJUK- M 
IBRAHIM- B SURGA- M 
PERINTAH- E JALAN- 
RIWAYAT- B KEHENDAK- ME 
DUNIA- i KAUM- 
MUSA- i CIPTA- B 
AZAB- E KITAB- —_— 
NERAKA- E BESAR- ME 
SALAT- B } t : AYAT- M } $ ; 
0 500 1000 1500 0 100 200 300 400 
Frequency Frequency 
(a) (b) 


Figure 4. The 30 most frequent words in the tafseer and translation, (a) most frequent words in tafseer and 


(b) most frequent words in translation 
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The result of the pearson correlation coefficient value is 0.5306. The 0.5306 value means a positive 
moderate correlation of words occurred inside both the tafseer and the translation. This means that there is 
tendency that the higher frequency of the word occurred in tafseer, the higher frequency of that word 
occurred in translation, and vice versa. Thus, even though there are some differences on the most frequent 
words between the tafseer and translation, there is always a tendency that the same words occurred in both of 
them. This information is beneficial in ensuring that the tafseer and translation version are having the same 
directions. In other words, one can trust to refer from both tafseer and translation of this version due to the 
same pattern of the words. 


4.4. K-means clustering results 

In the K-means clustering phase, the initial step was to determine the K value that would be used in 
the clustering process. The determination of the best K value was done based on its SSE evaluation. Figure 5 
shows the SSE Cluster Center Plot for tafseer in Figure 5(a) and translation in Figure 5(b), respectively. As 
can be seen on presented graphs in Figure 5, the best K value for the tafseer is on K=10 and for translation is 
on K=8. Thus, this study focuses on analyzing the results of 8 and 10 clusters for both of the datasets. 


SSE by Cluster Center Plot SSE by Cluster Center Plot 
17400 
5500 7 
17300 
5450 
17200 5400 
i 17100 i 5350 5 
17000 5300 4 
16900 | 
5200 
16800 
T T T T T T T T T T 7 br T ——! 
2 A 6 8 10 12 u 16 18 20 2 4 6 8 10 12 14 16 18 20 
Cluster Centers Cluster Centers 
(a) (b) 


Figure 5. SSE vs cluster centers plot for tafseer and translation, (a) tafseer and (b) translation 


Table 2 shows the clustering result for K=8 of the tafseer and translation. It should be noted that the 
clustering number is not ordered and does not matter in the clustering case. The cluster 1 of the tafseer shows 
words “kitab”, “anak”, “sihir”, “mesir”, “agama”, “harun”, “Tuhan”, “Firaun”, “Bani Israil”, and “Musa”. 
This shows a good example of one clustering. Referred to the Quran, the Prophet Musa story is narrated. The 
story is about the duty of Prophet Musa to remind Firaun and accompanied by Prophet Harun. The place was 
in Egypt or “Mesir” where the bad magic or “Sihir” was popular at that time. 


Table 2. Clustering results with K=8 clusters 


Cluster Tafseer Translation 
1 Kitab, Anak, Mukjizat, Sihir, Mesir, Agama, Harun, Tuhan, Firaun, Israil, Petunjuk, Rasul, Benar, Mengingkari, Kafir, 
Bani, Musa Azab, Mendustakan, Kitab, Quran. 
2 Hati, Peringatan, Ajaran, Hukum, Petunjuk, Agama, Kitab, Quran Kitab, Kiamat, Peringatan, Rasul, Musa, 
Kafir, Neraka, Azab. 
Tanah, Planet, Gunung, Siang, Hujan, Kekuasaan, Bulan, Malam, Benda, Baik, Kebesaran, Hujan, Tanda, Air, 
3 Alam, Tanda, Matahari, Air, Menciptakan, Bintang, Mahluk, Malaikat, Gunung, Langit, Bumi 
Langit, Bumi 
4 Tirmidzi, Imam, Ibnu, Ismail, Ahmad, Bukhari, Hurairah, Abu. Karunia, Kafir, Hati, Petunjuk, Muhammad, 
Hamba, Rasul, Beriman. 
5 Balasan, Pahala, Kehidupan, Berhala, Dosa, Hamba, Kafir, Nikmat, Neraka, Kenikmatan, Bertaqwa, Kebajikan, Kekal, 
Surga, Amal, Azab, Akhirat, Dunia. Petunjuk, Lurus, Sungai, Surga, Manusia. 
6 Hati, Peringatan, Ajaran, Hukum, Petunjuk, Agama, Kitab, Quran Petunjuk, Janji, Jalan, Azab, Firaun, Kitab, 
Rasul, Muhammad, Tanda, Benar 
7 Dosa, Kiamat, Perempuan, Laki, Tempat, Kafir. Berdoa, Rahmat, Quran, Azab, Pengampun, 
Pengasih, Penyayang. 
8 Istidraj, Istiadat, Israfil, Israil, Istana, Istilah, Isteri, Petunjuk, Malaikat Beriman, Istri, Azab, Akhirat, Yatim, Dunia, 


Harta, Nikmat, Perempuan, Laki 
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Seeing the members of the cluster 3 in the tafseer, this cluster contains astronomical terms, for 
example “Planet”, “Bulan”, “Bintang”, “Matahari, “Langit” and “Bumi”. This cluster might be a partition 
about the perspective of universe creation from Quran. Another interesting cluster result in tafseer is the 
cluster 4 which are the name of hadith narrators, which make a good cluster as well. In cluster 5 of the 
tafseer, the words quite interesting, as it contains pair of opposite words, such as “neraka” and “surga”, 
“hamba” and “kafir”, or “pahala” and “dosa”. It can be seen here that in the tafseer, the bad and the good are 
narrated together in one case so they become close and get into one cluster. Cluster 6 also shows a good 
example of one cluster theme because of their topic closeness, about Quran and Kitab as laws of Muslims 
which already discussed in the literature review. 

The results of clustering from the translation are not as clear as the tafseer. There are some same 
words appeared on each cluster, for example, the word “Azab” and “Petunjuk” which makes it difficult to 
decide the main topic of each cluster. Then, the information which can be acquired is that the distance of 
each word in translation are not really far from each other. Meaning, using TF — IDF weighting method, the 
term most likely appears on each document the same amount of times. 

Next, observation was done to the results presented in Table 3 for K=10. For the case of tafseer, 
there are five clusters which are similar with the previous result. Whereas for the case of translation, it starts 
to get clearer for same clusters. As for examples, the words “mata”, “air”, “balasan”, “baik”, “taman”, 
“buah”, “penghuni”, “kenikmatan”, “mengalir”, “kekal”,” sungai”, and “surga” are go to one cluster in 
translation. However, overall, determining theme of the translation cluster result is still not easy to be 
decided. 


Table 3. Clustering results with K=10 clusters 


Cluster Tafseer Translation 
1 Munafik, Larangan, Kemenangan, Musuh, Yahudi, Ibrahim, Mekah, Hati, Langit, Hamba, Tobat, Bumi, Quran, 
Perang, Musyrik, Kafir. Muhammad, Beriman 
2 Istidraj, Isteri, Israfil, Israil, Istana, Petunjuk, Malaikat Gembira, Perjalanan, Kafir, Celakalah, Manusia, 


Beriman, Kebenaran, Peringatan. 
Malaikat, Hamba, Benar, Sahaya, Istri, Anak, 
Perempuan, Laki. 


3 Negeri, Tanda, Nuh, Setan, Nikmat, Hati, Kiamat, Kafir, Neraka. 


4 Nikmat, Hamba, Quran, Ajaran, Surga, Umat, Baik, Kesenangan, Sapi, Takut, Bumi, Malam, Firaun, Harun, 
Kebahagiaan, Neraka, Kafir, Kehidupan, Azab, Hidup, Akhirat, Kekuasaan, Kebesaran, Tanda. 
Dunia. 


5 Isa, Hud, Esa, Musyrik, Sembah, Berhala, Patung, Tuhan. Diutus, Azab, Umat, Yatim, Nuh, Harta, Anak, Rasul. 


6 Dawud, Abdullah, Umar, Tirmidzi, Imam, Ahmad, Ibnu, Bukhari, Mata, Air, Balasan, Baik, Taman, Buah, Penghuni, 
Muslim, Abu, Hurairah, Sabda. Kenikmatan, Mengalir, Kekal, Sungai, Surga. 

7 Jalan, Kebajikan, Buruk, Balasan, Sifat, Isteri, Ibu, Surga, Saleh, Air, Golongan, Dunia, Gunung, Waktu, Negeri, 
Pahala, Perempuan, Laki, Hamba, Harta, Amal, Dosa, Anak. Muhammad, Malaikat, Baik, Kiamat, Manusia, Azab, 

Neraka. 

8 Harun, Hati, Sihir, Umat, Petunjuk, Kaum, Mukjizat, Kebenaran, Petunjuk, Nikmat, Muhammad, Dustakan, Azab, 
Taurat, Firaun, Bani, Israil, Kitab, Musa. Kafir. 

9 Tumbuhan, Planet, Bulan, Kekuasaan, Tanda, Benda, Tanah, Azab, Kafir, Kerajaan, Tanda, Janji, Rasul, Bumi, 
Gunung, Alam, Matahari, Hujan, Cipta, Mahluk, Binatang, Air, Langit, Besar 
Langit, Bumi. 

10 Ibrahim, Menyampaikan, Hamba, Mahluk, Utusan, Laki, Wahyu, Puji, Zalim, Disembah, Esa, Langit, Bumi, Azab, 


Jibril, Lut, Adam, Malaikat. 


Pengasih. 


Results from K-means clustering have shown that with K=8 the created clusters from the tafseer 
have converged to obvious themes. However, the case for translation was different, where the created clusters 
have not shown clear themes. The major reason for tafseer to show clear grouping in each cluster is that 
tafseer usually narrated and described similar topics into one story, such as story of Musa and Firaun, Quran 
as Muslims Law, and Astronomical Creatures. These kinds of structures were not the case for the translation. 
Translation more into just translating the saying from Arabic to Indonesian for each ayah in the surah, which 
not always converge to similar topic. 


4.5. Association rules results 

Results of interesting association within the tafseer and translation are presented in this section. 
Figure 6 shows the association of the word “Allah” from translation dataset. Except the word “kafir”, all of 
associations are showing positive sentiments. High support values are shown from the association of word 
“Allah” with “memberi” which means “to give”, “petunjuk” which means ‘guidance”, and “penyayang” 
which means “loving”. The support value of the association with word “kafir” which means “non-believer” is 
0.004 and the confidence is 0.957, meaning from the whole translation documents, 0.4% occurrence together 
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the term “Allah” and “kafir” in one document. In addition, 95.7% of the documents in the translation contain 
term “Allah” also contain “kafir”. To see the meaning of this, further referencing is done by looking up into 
the translation dataset. One sample from this association is Surah An-Nahl Ayah 106-107. The ayahs show 
that Allah always narrate on how the bad fate would come to kafeer, which are people who deny the truth of 
Allah. 

The other thing is, there are words having several different support and confidence values when they 
are associated with different words as well. For example, when the word “jalan” associated with only the 
word “Allah”, the support and confidence value is 0.005 and 0.810. However, when the word “jalan” 
associated with word “Allah” and also word “kebaikan”, those values are 0.003 and 0.833. The other 
association of the word “jalan” is with “kafir” by values 0.04 and 0.957. This kind of occurences also applied 
to some other words. 


KAFIR 
MEMBERI 


KEHENDAK 


PENYAYANG 


FIRMAN 


Figure 6. A result of association rules process on translation dataset 


For the case of tafseer, Figure 7 shows the results from the association rules. The word “israil”, 
“musa”, “and “bani” are in the same values of support and confidence, which is 0.1% of the whole 
documents contain their union and 54% of the documents contains those words. To compare with the 
previous clustering result, this is also related to the cluster 1 which contains those words and the word 
“Harun”. 

From both association rule results, it can be observed that there are relations with the previous 
results, frequent pattern mining and clustering result. The sequence example of the information retrieval from 
this result is, after knowing that “Musa” is one of the most frequent word in the tafseer and translation, then 
one can find the cluster “Musa” in the clustering result. Next, further information about the association of 
each word in that cluster can be determined by this result. By doing this sequence, knowledge that the 
Prophet Musa did a duty from Allah to remind the Firaun can be revealed. As well, about Musa who then 


Text mining approaches for analyzing an indonesian tafseer and ... (Media Anugerah Ayu) 


1478 O ISSN: 2502-4752 


asked Allah Azza Wa Jalla, that he wanted his brother, Prophet Harun to accompany him in this duty. Also, 
information that those took place in Mesir which is Egypt now. 

Several benefits can be drawn from knowing these association rules results. The first possible 
benefit is to enable the Islamic scholars and muslims to know and/or reveal connections in a certain topic that 
they would like to learn further. For example, say one wants to know about Prophet Musa by referring to 
Indonesian Tafseer. Without knowing the association rule, he/she might just focus only to the word “Musa” 
in the tafseer and have to read the whole sentences in the tafseer about “Musa” to be able to draw valuable 
information about Prophet Musa. However, by knowing and having the association rules list of the word 
“Musa”, insight knowledge will be able to be gained faster. For instance, using (“Musa”, “Mesir’’)-> 
“Agama” or (“Musa”, “Bani Israil’)-> “Agama”, one can take a look at those words to focus in searching 
information about Prophet Musa. 

Another benefit is in business, specifically online bookstores or libraries. Say one user accesses to 
an online book store or library and is interested in book tagged “Musa” as the keyword. Then, the systems 
could be able to give what kind of books that might interests the user and create a preference book for the 
user. Because in tafseer the word “Musa” has association rule with “Mesir” and/or “Bani Israil”, the systems 
can give suggestion and recommendation for books which tagged with words “Mesir” and/or “Bani Israil”. 
Of course, to be able to do that, it needs further process. However, that is the general thing that the 
association rules can provide further assistance in business area. 


BANI 


MESIR 


MUSA 


MENINGGALKAN 
ISLAM 


QURAN 


Figure 7. A result of association rules process on tafseer dataset 


5. CONCLUSIONS 

This research study has conducted a text mining on Indonesian tafseer and translation of Quran 
through several approaches, i.e. most frequent words, K-means clustering, and association rules. Valuable 
information from tafseer and translation is succeeded to be obtained through the text mining perspective. The 
30 most frequent words inside the tafseer and translation were presented and showing 17 mutual words from 
tafseer and translation occurred in the 30 ranking. The correlation result shows that the mutual words from 
tafseer and translation having 0.5306 value, meaning there is tendency that the higher frequency of the word 
occurred in tafseer, the higher frequency also occurred in translation, vice versa. This result tells us that the 
tafseer and translation of this version most likely to exchange information with the similar meaning and there 
is less natural language processing problem in both datasets. Then, the clustering results of tafseer and 
translation are obtained using the K-Means technique. The best partition result shown by the tafseer with 
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K=8 clusters. However, the result from translation dataset did not show as good as tafseer partition. The 
clustering result from tafseer solved the unstructured dataset problem. By knowing the clusters, one can 
differentiate each topic based on their similarities, thus the information retrieval can be more relevant and 
efficient. Furthermore, the results from the association rules show some interesting relations between terms 
or words in the tafseer and translation. Results from all those three approaches actually supported each other. 
The frequent words mining shows word “Musa”, for example, and it appears in one of the good cluster 
partitions containing “Musa”, “Firaun”, and others. Then, the association among those words can be seen and 
measured by association rule result. The sample of rules in tafseer showed by the association (“Musa”, 
“Bani”, “Israil”) -> “Agama”. These show that combining the three approaches could lead to ways on 
retrieving more information and revealing more insight knowledge from Quran and its tafseer. 
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